Cleaning uncertain data with quality guarantees
نویسندگان
چکیده
Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.
منابع مشابه
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers
Applications consuming data have to deal with variety of data quality issues such as missing values, duplication, incorrect values, etc. Although automatic approaches can be utilized for data cleaning the results can remain uncertain. Therefore updates suggested by automatic data cleaning algorithms require further human verification. This paper presents an approach for generating tasks for unc...
متن کاملData Quality Problems beyond Consistency and Deduplication
Recent work on data quality has primarily focused on data repairing algorithms for improving data consistency and record matching methods for data deduplication. This paper accentuates several other challenging issues that are essential to developing data cleaning systems, namely, error correction with performance guarantees, unification of data repairing and record matching, relative informati...
متن کاملContinuous Post-Mining of Association Rules in a Data Stream Management System
The real-time (or just-on-time) requirement associated with online association rule mining implies the need to expedite the analysis and validation of the many candidate rules, which are typically created from the discovered frequent patterns. Moreover, the mining process, from data cleaning to post-mining, can no longer be structured as a sequence of steps performed by the analyst, but must be...
متن کاملEfficiently and effectively processing probabilistic queries on uncertain data
Significance. Driven by many recent applications including social networks, sensor networks, data cleaning and integration, moving objects, image processing, information retrieval, crime control, economic decision making and market surveillance, querying and analyzing uncertain data draws a great deal of research attention from database community. A number of system prototypes for managing unce...
متن کاملAn optimization model for management of empty containers in distribution network of a logistics company under uncertainty
In transportation via containers, unbalanced movement of loaded containers forces shipping companies to reposition empty containers. This study addresses the problem of empty container repositioning (ECR) in the distribution network of a European logistics company, where some restrictions impose decision making in an uncertain environment. The problem involves dispatching empty contain...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 1 شماره
صفحات -
تاریخ انتشار 2008